75 research outputs found
A rigorous statistical framework for spatio-temporal pollution prediction and estimation of its long-term impact on health
In the United Kingdom, air pollution is linked to around 40000 premature deaths each year, but estimating its health effects is challenging in a spatio-temporal study. The challenges include spatial misalignment between the pollution and disease data; uncertainty in the estimated pollution surface; and complex residual spatio-temporal autocorrelation in the disease data. This article develops a two-stage model that addresses these issues. The first stage is a spatio-temporal fusion model linking modeled and measured pollution data, while the second stage links these predictions to the disease data. The methodology is motivated by a new five-year study investigating the effects of multiple pollutants on respiratory hospitalizations in England between 2007 and 2011, using pollution and disease data relating to local and unitary authorities on a monthly time scale
On Bayesian "central clustering": Application to landscape classification of Western Ghats
Landscape classification of the well-known biodiversity hotspot, Western
Ghats (mountains), on the west coast of India, is an important part of a
world-wide program of monitoring biodiversity. To this end, a massive
vegetation data set, consisting of 51,834 4-variate observations has been
clustered into different landscapes by Nagendra and Gadgil [Current Sci. 75
(1998) 264--271]. But a study of such importance may be affected by
nonuniqueness of cluster analysis and the lack of methods for quantifying
uncertainty of the clusterings obtained. Motivated by this applied problem of
much scientific importance, we propose a new methodology for obtaining the
global, as well as the local modes of the posterior distribution of clustering,
along with the desired credible and "highest posterior density" regions in a
nonparametric Bayesian framework. To meet the need of an appropriate metric for
computing the distance between any two clusterings, we adopt and provide a much
simpler, but accurate modification of the metric proposed in [In Felicitation
Volume in Honour of Prof. B. K. Kale (2009) MacMillan]. A very fast and
efficient Bayesian methodology, based on [Sankhy\={a} Ser. B 70 (2008)
133--155], has been utilized to solve the computational problems associated
with the massive data and to obtain samples from the posterior distribution of
clustering on which our proposed methods of summarization are illustrated.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS454 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
A Comparative Study of Text Embedding Models for Semantic Text Similarity in Bug Reports
Bug reports are an essential aspect of software development, and it is
crucial to identify and resolve them quickly to ensure the consistent
functioning of software systems. Retrieving similar bug reports from an
existing database can help reduce the time and effort required to resolve bugs.
In this paper, we compared the effectiveness of semantic textual similarity
methods for retrieving similar bug reports based on a similarity score. We
explored several embedding models such as TF-IDF (Baseline), FastText, Gensim,
BERT, and ADA. We used the Software Defects Data containing bug reports for
various software projects to evaluate the performance of these models. Our
experimental results showed that BERT generally outperformed the rest of the
models regarding recall, followed by ADA, Gensim, FastText, and TFIDF. Our
study provides insights into the effectiveness of different embedding methods
for retrieving similar bug reports and highlights the impact of selecting the
appropriate one for this task. Our code is available on GitHub.Comment: 7 Page
- …